Discourse-Givenness of Noun Phrases: theoretical and computational models
نویسنده
چکیده
ness/SpecificityNews texts usually report on events involving concrete referents. Concrete referents tendto be described rather specifically, often resulting in longer expressions (length in chars).Also, embedded definite phrases or expressions with a possessive ‘s’ often have specificreferents: typical possessors are e.g. persons or organizations. On the other hand, certainsuffixes point to abstract common nouns, such as -ion, -ness etc. (suffix n).Specificity is a precondition for coreference in many corpora. I understand specificity asa feature representing two components:a) an expression’s descriptive content (preciseness of description), andb) relatedness to the context it occurs in.When referring to a particular object, the speaker needs to distinguish it a) from otherobjects in the world, and b) from other discourse referents in the context.A distinction from other objects in the world can be accomplished by using proper namesor precise descriptions (e.g. the blue book or the lexicon is more precise than the book).A distinction from other discourse referents can be made by using different concepts,or, when re-using lexical material, by adding descriptive content (e.g. the town thenew town, a bus another bus) or by using the indefinite plural form (e.g. Investigatorscontinued their search. At Calverton, investigators began piecing together the airplane).For an operationalisation of specificity, I assume that1) the more precise the description, the fewer documents it occurs in, and2) the more important a discourse referent is for a discourse d , the more frequently itoccurs in d .As a measure for these two components of specificity, I use1) inverse document frequency (idf) and2) term frequency (tf), combined to tfidf, a well-known measure from the field of Infor-mation Retrieval.For the purpose of discourse-givenness classification, tfidf is calculated using the fullexpression as a term on the one hand, and using a sliding window of characters as a termon the other hand (4 characters in the experiments presented here). The sliding windowwas implemented to make the method more robust against inflection, composition, andthe shortening of names (e.g. Alex for Alexander), considering in particular that it isapplied not only to English.The calculation of tf-idf-related features is sketched at the end of this section.SimilaritySynonyms are sometimes used for referring to the same entity for stylistic reasons, e.g. toavoid word repetitions. In earlier work, latent semantic analysis and variants (Hempel-mann et al., 2005) have been used, as well as semantic relatedness, measured e.g. bymeans of WordNet or GermaNet (Markert et al., 2012; Cahill and Riester, 2012). In thepresent work, semantic similarity as calculated by DISCO (Kolb, 2008) is used in thefeature maxsimilar mention. The calculation of this feature is sketched at the end ofthis section.ContextMotivated by the findings in the theoretical part (Section 2.3), I also introduce featuresexploiting an expression’s context. These contextual features are designed to complementDistributional similarity (LSA and span) between all words of the NP and the NP’s preceding contexthas been used by Hempelmann et al. (2005) in their logistic regression experiments.
منابع مشابه
Using LSA to Automatically Identify Givenness and Newness of Noun Phrases in Written Discourse
Identifying given and new information within a text has long been addressed as a research issue. However, there has previously been no accurate computational method for assessing the degree to which constituents in a text contain given versus new information. This study develops a method for automatically categorizing noun phrases into one of three categories of givenness/newness, using the tax...
متن کاملThe Discourse Structuring Potential of Definite Noun Phrases in Natural Discourse
This paper investigates an alternation found with definite noun phrases in direct object position in Romanian that represents a theoretical puzzle for current theories of Differential Object Marking or pe-marking (Dobrovie-Sorin 1994). When in direct object position and unmodified, definite noun phrases can be accompanied either by the differential object marker pe, or by the simple enclitic de...
متن کاملThe Discourse Structuring Potential of Definite Noun Phrases in Romanian
This paper investigates an alternation found with definite noun phrases in direct object position in Romanian that represents a theoretical puzzle for current theories of Differential Object Marking in this language (Gramatica Limbii Române 2005, Klein & de Swart 2011). When in direct object position and unmodified, definite noun phrases can be accompanied either by the differential object mark...
متن کاملCorpus-Based Identification of Non-Anaphoric Noun Phrases
Coreference resolution involves finding antecedents for anaphoric discourse entities, such as definite noun phrases. But many definite noun phrases are not anaphoric because their meaning can be understood from general world knowledge (e.g., "the White House" or "the news media"). We have developed a corpus-based algorithm for automatically identifying definite noun phrases that are non-anaphor...
متن کاملAnaphora Resolution: Short-Term Memory and Focusing
INTRODUCTION Anaphora resolution is the process of determining the referent of ~uaphors. such as definite noun phrases and pronouns, in a discourse. Computational linguists, in modeling the process of anaphora resolution, have proposed the notion of focusing. Focusing is the process, engaged in by a reader, of selecting a subset of the discourse items and maJ£ing them highly available for furth...
متن کامل